Data preprocessing and Feature Engineering in ML

Manish Patel

2023-05-16

Feature Engineering

Feature engineering is the ‘art’ of formulating useful features from existing data following the target to be learned and the machine learning model used.

FEATURE ENGINEERING PIPELINE

Why ??

DEMONSTRATION

import pandas as pd
data={'Candy Variety':['Chocolate Hearts','Sour Jelly','Candy Canes','Sour Jelly','Fruit Drops'], 'Date and Time':['09-02-2020 14:05','24-10-2020 18:00','18-12-2020 20:13','25-10-2020 10:00','18-10-2020 15:46'], 'Day':['Sunday','Saturday','Friday','Sunday','Sunday'], 'Length':[3, 3.5, 3.5, 3.5, 5], 'Breadth':[2,2,2.5,2,3], 'Price':[7.5, 7.6, 8, 7.6, 9]}
df = pd.DataFrame(data)
df['Date and Time'] = pd.to_datetime(df['Date and Time'], format="%d-%m-%Y %H:%M")
df
Candy Variety Date and Time Day Length Breadth Price
0 Chocolate Hearts 2020-02-09 14:05:00 Sunday 3.0 2.0 7.5
1 Sour Jelly 2020-10-24 18:00:00 Saturday 3.5 2.0 7.6
2 Candy Canes 2020-12-18 20:13:00 Friday 3.5 2.5 8.0
3 Sour Jelly 2020-10-25 10:00:00 Sunday 3.5 2.0 7.6
4 Fruit Drops 2020-10-18 15:46:00 Sunday 5.0 3.0 9.0

which kind of candy is most likely to sell the most on a particular day?

df['Date']=df['Date and Time'].dt.date
df[['Candy Variety','Date']]
Candy Variety Date
0 Chocolate Hearts 2020-02-09
1 Sour Jelly 2020-10-24
2 Candy Canes 2020-12-18
3 Sour Jelly 2020-10-25
4 Fruit Drops 2020-10-18

Feature Engineering in action

import numpy as np
df['Weekend'] = np.where(df['Day'].isin(['Saturday', 'Sunday']), 1, 0)
df[['Candy Variety','Date','Weekend']]
Candy Variety Date Weekend
0 Chocolate Hearts 2020-02-09 1
1 Sour Jelly 2020-10-24 1
2 Candy Canes 2020-12-18 0
3 Sour Jelly 2020-10-25 1
4 Fruit Drops 2020-10-18 1

FEATURE ENGINEERING TECHNIQUES

1) Imputation

Imputation deals with handling missing values in data.

  • Categorical Imputation: Missing categorical values are generally replaced by the most commonly occurring value in other records
  • Numerical Imputation: Missing numerical values are generally replaced by the mean of the corresponding value in other records

Add missing values

data={'Candy Variety':['Chocolate Hearts','Sour Jelly','Candy Canes','Sour Jelly','Fruit Drops'], 'Date and Time':['09-02-2020 14:05','24-10-2020 18:00','18-12-2020 20:13','25-10-2020 10:00','18-10-2020 15:46'], 'Day':['Sunday','Saturday','Friday','Sunday','Sunday'], 'Length':[3, 3.5, 3.5, 3.5, 5], 'Breadth':[2,2,2.5,2,3], 'Price':[7.5, 7.6, 8, 7.6, 9]}
df = pd.DataFrame(data)
df['Date and Time'] = pd.to_datetime(df['Date and Time'], format="%d-%m-%Y %H:%M")

#Appending a row with missing values
df.loc[len(df.index)] =[np.NaN,'22-10-2020 17:24:00','Thursday', 3.5, 2, np.NaN]
df
Candy Variety Date and Time Day Length Breadth Price
0 Chocolate Hearts 2020-02-09 14:05:00 Sunday 3.0 2.0 7.5
1 Sour Jelly 2020-10-24 18:00:00 Saturday 3.5 2.0 7.6
2 Candy Canes 2020-12-18 20:13:00 Friday 3.5 2.5 8.0
3 Sour Jelly 2020-10-25 10:00:00 Sunday 3.5 2.0 7.6
4 Fruit Drops 2020-10-18 15:46:00 Sunday 5.0 3.0 9.0
5 NaN 22-10-2020 17:24:00 Thursday 3.5 2.0 NaN

SCIKIT LEARN IMPUTER

from numpy import nan
X = np.array([[ nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   nan, 6  ],
              [ 8,   8,   1  ]])
y = np.array([14, 16, -1,  8, -5])
  • For a baseline imputation approach using the mean, median, or most frequent value, Scikit-Learn provides the SimpleImputer class:

Mean Strategy

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X2 = imp.fit_transform(X)
X2
array([[4.5, 0. , 3. ],
       [3. , 7. , 9. ],
       [3. , 5. , 2. ],
       [4. , 5. , 6. ],
       [8. , 8. , 1. ]])

Drawback

df['Candy Variety']=df['Candy Variety'].fillna(df['Candy Variety'].mode()[0])
df['Price']=df['Price'].fillna(df['Price'].mean())
df
Candy Variety Date and Time Day Length Breadth Price
0 Chocolate Hearts 2020-02-09 14:05:00 Sunday 3.0 2.0 7.50
1 Sour Jelly 2020-10-24 18:00:00 Saturday 3.5 2.0 7.60
2 Candy Canes 2020-12-18 20:13:00 Friday 3.5 2.5 8.00
3 Sour Jelly 2020-10-25 10:00:00 Sunday 3.5 2.0 7.60
4 Fruit Drops 2020-10-18 15:46:00 Sunday 5.0 3.0 9.00
5 Sour Jelly 22-10-2020 17:24:00 Thursday 3.5 2.0 7.94

2) Discretization

Discretization involves essentially taking a set of values of data and grouping sets of them together in some logical fashion into bins (or buckets).

METHODS

  1. Grouping of equal intervals
  2. Grouping based on equal frequencies (of observations in the bin)
  3. Grouping based on decision tree sorting (to establish a relationship with target)

Example

df['Type of Day']=np.where(df['Day'].isin(['Saturday', 'Sunday']), 'Weekend', 'Weekday')
df[['Candy Variety','Day','Type of Day']]
Candy Variety Day Type of Day
0 Chocolate Hearts Sunday Weekend
1 Sour Jelly Saturday Weekend
2 Candy Canes Friday Weekday
3 Sour Jelly Sunday Weekend
4 Fruit Drops Sunday Weekend
5 Sour Jelly Thursday Weekday

3) Categorical Encoding

Categorical encoding is the technique used to encode categorical features into numerical values which are usually simpler for an algorithm to understand.

One hot encoding(OHE)

  • Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category
  • Can explode if a feature has lots of values, causing issues with high dimensionality
  • What if test set contains a new category not seen in training data?
    • Either ignore it (just use all 0’s in row), or handle manually (e.g. resample)

EXAMPLE

for x in df['Type of Day'].unique():
    df[x]=np.where(df['Type of Day']==x,1,0)
df[['Candy Variety','Day','Type of Day','Weekend','Weekday']]
Candy Variety Day Type of Day Weekend Weekday
0 Chocolate Hearts Sunday Weekend 1 0
1 Sour Jelly Saturday Weekend 1 0
2 Candy Canes Friday Weekday 0 1
3 Sour Jelly Sunday Weekend 1 0
4 Fruit Drops Sunday Weekend 1 0
5 Sour Jelly Thursday Weekday 0 1

Drawback

  • It could result in a dramatic increase in the number of features and result in the creation of highly correlated features.

Other methods

  1. Count and Frequency encoding- captures each label’s representation,
  2. Mean encoding -establishes the relationship with the target
  3. Ordinal encoding- number assigned to each unique label.

4) Feature Splitting

Splitting features into parts can sometimes improve the value of the features toward the target to be learned.

EXAMPLE

df['Date and Time'] = pd.to_datetime(df['Date and Time'])
df['Date']=df['Date and Time'].dt.date
df[['Candy Variety','Date']]
Candy Variety Date
0 Chocolate Hearts 2020-02-09
1 Sour Jelly 2020-10-24
2 Candy Canes 2020-12-18
3 Sour Jelly 2020-10-25
4 Fruit Drops 2020-10-18
5 Sour Jelly 2020-10-22

5) Handling Outliers

Outliers are unusually high or low values in the dataset which are unlikely to occur in normal scenarios.

Since these outliers could adversely affect your prediction they must be handled appropriately. The various methods of handling outliers include:

  • Removal: The records containing outliers are removed from the distribution. However, the presence of outliers over multiple variables could result in losing out on a large portion of the datasheet with this method.
  • Replacing values: The outliers could alternatively bed treated as missing values and replaced by using appropriate imputation.
  • Capping: Capping the maximum and minimum values and replacing them with an arbitrary value or a value from a variable distribution.
  • Discretization

6) Variable Transformations

Variable transformation techniques could help with normalizing skewed data.

  • Logarithmic transformations operate to compress the larger numbers and relatively expand the smaller numbers. This in turn results in less skewed values especially in the case of heavy-tailed distributions.
  • Other variable transformations used include Square root transformation and Box cox transformation which is a generalization of the former two.

7) Scaling

Feature scaling is done owing to the sensitivity of some machine learning algorithms to the scale of the input values. This technique of feature scaling is sometimes referred to as feature normalization.

S

TYPES OF SCALING

The commonly used processes of scaling include:

  • Min-Max Scaling: This process involves the rescaling of all values in a feature in the range 0 to 1. In other words, the minimum value in the original range will take the value 0, the maximum value will take 1 and the rest of the values in between the two extremes will be appropriately scaled.
  • Standardization/Variance scaling: All the data points are subtracted by their mean and the result divided by the distribution’s variance to arrive at a distribution with a 0 mean and variance of 1.

EXAMPLE

Min-max scaling

  • Scales all features between a given \(min\) and \(max\) value (e.g. 0 and 1)
  • Makes sense if min/max values have meaning in your data
  • Sensitive to outliers

\[\mathbf{x}_{new} = \frac{\mathbf{x} - x_{min}}{x_{max} - x_{min}} \cdot (max - min) + min \]

Robust scaling

  • Subtracts the median, scales between quantiles \(q_{25}\) and \(q_{75}\)
  • New feature has median 0, \(q_{25}=-1\) and \(q_{75}=1\)
  • Similar to standard scaler, but ignores outliers

plot_scaling(scaler=RobustScaler())

8) Creating Features

Feature creation involves deriving new features from existing ones.

This can be done by simple mathematical operations such as aggregations to obtain the mean, median, mode, sum, or difference and even product of two values.

WORKING EXAMPLE

data={'Candy Variety':['Chocolate Hearts','Sour Jelly','Candy Canes','Sour Jelly','Fruit Drops'], 'Date and Time':['09-02-2020 14:05','24-10-2020 18:00','18-12-2020 20:13','25-10-2020 10:00','18-10-2020 15:46'], 'Day':['Sunday','Saturday','Friday','Sunday','Sunday'], 'Length':[3, 3.5, 3.5, 3.5, 5], 'Breadth':[2,2,2.5,2,3], 'Price':[7.5, 7.6, 8, 7.6, 9]}
df = pd.DataFrame(data)
df
Candy Variety Date and Time Day Length Breadth Price
0 Chocolate Hearts 09-02-2020 14:05 Sunday 3.0 2.0 7.5
1 Sour Jelly 24-10-2020 18:00 Saturday 3.5 2.0 7.6
2 Candy Canes 18-12-2020 20:13 Friday 3.5 2.5 8.0
3 Sour Jelly 25-10-2020 10:00 Sunday 3.5 2.0 7.6
4 Fruit Drops 18-10-2020 15:46 Sunday 5.0 3.0 9.0

SIMPLE LINEAR REGRESSION

import matplotlib.pyplot as plt
import numpy as np

def simple_linear_regression(x, y):
    # number of observations
    n = np.size(x)
    
    mean_x = np.mean(x)
    mean_y = np.mean(y)
 

    xy = np.sum(y*x) - n*mean_y*mean_x
    xx = np.sum(x*x) - n*mean_x*mean_x
 
    # calculating slope
    m = xy / xx
    
    #calculating intercept
    c = mean_y - m*mean_x
 
    return m,c

LENGTH PARAMETER

x=df['Length'].to_numpy()
y=df['Price'].to_numpy()

m,c = simple_linear_regression(x,y)
y_pred = c + m*x

plt.plot(x, y_pred , color = "g", label='Price Prediction')
plt.scatter(df['Length'].to_numpy() , y, marker='1', label='Training set')
plt.xlabel('Length')
plt.ylabel('Price')
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

BREADTH PARAMETER

x=df['Breadth'].to_numpy()
y=df['Price'].to_numpy()

m,c = simple_linear_regression(x,y)
y_pred = c + m*x

plt.plot(x, y_pred , color = "g", label='Price Prediction')
plt.scatter(df['Breadth'].to_numpy() , y, marker='1', label='Training set')
plt.xlabel('Breadth')
plt.ylabel('Price')
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

FEATURE ENGINEERING

df['Size']=df['Breadth']*df['Length']
df[['Candy Variety','Price', 'Size']]
Candy Variety Price Size
0 Chocolate Hearts 7.5 6.00
1 Sour Jelly 7.6 7.00
2 Candy Canes 8.0 8.75
3 Sour Jelly 7.6 7.00
4 Fruit Drops 9.0 15.00

SIZE PARAMETER

x=df['Size'].to_numpy()
y=df['Price'].to_numpy()

m,c = simple_linear_regression(x,y)
y_pred = c + m*x

plt.plot(x, y_pred , color = "g", label='Price Prediction')
plt.scatter(df['Size'].to_numpy() , y, marker='1', label='Training set')
plt.xlabel('Size')
plt.ylabel('Price')
plt.legend(bbox_to_anchor=(1, 1))
plt.show()